A Model-based K-means Algorithm for Name Disambiguation

نویسندگان

  • Hui Han
  • Hongyuan Zha
  • C. Lee Giles
  • Leandro Dardini
  • Roberto Grossi
چکیده

Unambiguous identities of resources are important aspect for semantic web. This paper addresses the personal identity issue in the context of bibliographies. Because of abbreviations or misspelling of names in publications or bibliographies, an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of identity matching, document retrieval and database federation, and causes improper attribution of research credit. This paper describes a new K-means clustering algorithm based on an extensible Naïve Bayes probability model to disambiguate authors with the same first name initial and last name in the bibliographies and proposes a canonical name. The model captures three types of bibliographic information: coauthor names, the title of the paper and the title of the journal or proceeding. The algorithm achieves best accuracies of 70.1% and 73.6% on disambiguating 6 different “J Anderson” s and 9 different "J Smith" s based on the citations collected from researchers’ publication web pages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Name Disambiguation Method Based on Atom Cluster

An improved name disambiguation method based on atom cluster. Aiming at the method of character-related properties of similarity based on information extraction depends on the character information, a new name disambiguation method is proposed, and improved k-means algorism for name disambiguation is proposed in this paper. The cluster analysis cluster is introduced to the name disambiguation p...

متن کامل

A hybrid DEA-based K-means and invasive weed optimization for facility location problem

In this paper, instead of the classical approach to the multi-criteria location selection problem, a new approach was presented based on selecting a portfolio of locations. First, the indices affecting the selection of maintenance stations were collected. The K-means model was used for clustering the maintenance stations. The optimal number of clusters was calculated through the Silhou...

متن کامل

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

Clustering web people search results using fuzzy ants

Person name queries often bring up web pages that correspond to individuals sharing the same name. The Web People Search (WePS) task consists of organizing search results for ambiguous person name queries into meaningful clusters, with each cluster referring to one individual. This paper presents a fuzzy ant based clustering approach for this multi-document person name disambiguation problem. T...

متن کامل

بهبود صحت ابهام‌زدایی نام نویسنده با استفاده از خوشه‌بندی تجمّعی

Today, digital libraries are important academic resources including millions of citations and bibliographic essential information such as titles, author's names and location of publications. From the view of knowledge accumulation management, the ability to search fast, accurate, desired contents, has a great importance. The complexity and similarity in these resources cause many challenges and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003